Skip to content

[branch-52] Fix Arrow Spill Underrun (#20159)#20684

Merged
alamb merged 1 commit intoapache:branch-52from
hareshkh:repartition_spill_buffering
Mar 4, 2026
Merged

[branch-52] Fix Arrow Spill Underrun (#20159)#20684
alamb merged 1 commit intoapache:branch-52from
hareshkh:repartition_spill_buffering

Conversation

@hareshkh
Copy link
Contributor

@hareshkh hareshkh commented Mar 3, 2026

Which issue does this PR close?

Rationale for this change

This adjusts the way that the spill channel works. Currently we have a spill writer & reader pairing which uses a mutex to coordindate when a file is ready to be read.

What happens is, that because we were using a spawn_buffered call, the read task would race ahead trying to read a file which is yet to be written out completely.

Alongside this, we need to flush each write to the file, as there is a chance that another thread may see stale data.

What changes are included in this PR?

Adds a flush on write, and converts the read task to not buffer reads.

Are these changes tested?

I haven't written a test, but I have been running the example in the attached issue. While it now fails with allocation errors, the original error goes away.

Are there any user-facing changes?

Nope

## Which issue does this PR close?

- Closes apache#19425

## Rationale for this change

This adjusts the way that the spill channel works. Currently we have a
spill writer & reader pairing which uses a mutex to coordindate when a
file is ready to be read.

What happens is, that because we were using a `spawn_buffered` call, the
read task would race ahead trying to read a file which is yet to be
written out completely.

Alongside this, we need to flush each write to the file, as there is a
chance that another thread may see stale data.

## What changes are included in this PR?

Adds a flush on write, and converts the read task to not buffer reads.

## Are these changes tested?

I haven't written a test, but I have been running the example in the
attached issue. While it now fails with allocation errors, the original
error goes away.

## Are there any user-facing changes?

Nope
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 3, 2026
@hareshkh
Copy link
Contributor Author

hareshkh commented Mar 4, 2026

@alamb @adriangb : Can you please take a look at this? (You reviewed the original PR)

@hareshkh hareshkh changed the title Fix Arrow Spill Underrun (#20159) [branch-52] Fix Arrow Spill Underrun (#20159) Mar 4, 2026
@adriangb
Copy link
Contributor

adriangb commented Mar 4, 2026

Could you link in the comment description (so it ends up in the commit) to the original PR, and how you created this diff (e.g. "cherry picked abc123")?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @hareshkh and @adriangb

I linked the PR that is being backported and I verified the diff is the same, and removed the duplicated template in the PR description

I also ran the problem query and verified that that before this PR , it fails like this:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ~/Software/datafusion-cli/datafusion-cli-52.2.0  -m 1G -c "SELECT \"UserID\", extract(minute FROM to_timestamp_seconds(\"EventTime\")) AS m, \"SearchPhrase\", COUNT(*) FROM '/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned' GROUP BY \"UserID\", m, \"SearchPhrase\" ORDER BY COUNT(*) DESC LIMIT 10;"
DataFusion CLI v52.2.0
Error: Arrow error: Io error: failed to fill whole buffer

And after this PR it succeeds

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ ./target/profiling/datafusion-cli  -m 1G -c "SELECT \"UserID\", extract(minute FROM to_timestamp_seconds(\"EventTime\")) AS m, \"SearchPhrase\", COUNT(*) FROM '/Users/andrewlamb/Software/datafusion/benchmarks/data/hits_partitioned' GROUP BY \"UserID\", m, \"SearchPhrase\" ORDER BY COUNT(*) DESC LIMIT 10;"
DataFusion CLI v52.2.0
+---------------------+----+--------------+----------+
| UserID              | m  | SearchPhrase | count(*) |
+---------------------+----+--------------+----------+
| 1313338681122956954 | 31 |              | 589      |
| 1313338681122956954 | 28 |              | 578      |
| 1313338681122956954 | 29 |              | 572      |
| 1313338681122956954 | 33 |              | 567      |
| 1313338681122956954 | 27 |              | 557      |
| 1313338681122956954 | 32 |              | 554      |
| 1313338681122956954 | 30 |              | 552      |
| 1313338681122956954 | 34 |              | 546      |
| 1313338681122956954 | 26 |              | 540      |
| 1313338681122956954 | 10 |              | 539      |
+---------------------+----+--------------+----------+
10 row(s) fetched.
Elapsed 184.374 seconds.

@alamb alamb merged commit 9a67de5 into apache:branch-52 Mar 4, 2026
38 of 39 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 4, 2026

Thanks again @hareshkh

@hareshkh hareshkh deleted the repartition_spill_buffering branch March 4, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants